── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ purrr 1.0.2
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)library(flextable)
Attaching package: 'flextable'
The following object is masked from 'package:purrr':
compose
options(scipen =999)
Purpose
While there are many projects around collating and exploring baby name data, I have yet to find one that looks at frequency (popularity) over time and country for all the countries I’m interested in.
There are some good and fairly comprehensive ones out there, but many of them don’t document their gathering and wrangling processes or are designed for different purposes. Here is mine.
When designing linguistic stimuli, sometimes we use gender stereotypes to probe comprehension processes. Sometimes we just need a bunch of names in order to vary the content of sentences without distracting from other aspects. Sometimes we need gender-ambiguous names. Sometimes we need gender-balanced names. Sometimes we need names that evoke certain beliefs or stereotypes to create stimuli that allow us to probe different syntactic, semantic, and social phenomena. Regardless, the social properties of names change over time as new stereotypes and distributions come to the forefront and are incorporated into individuals understanding of the world.
My project aims to:
Collect information about the names (primarily) English-speaking people (in primarily English-speaking countries) are exposed to
Identify how exposure in peer groups and in parent/grandparent/child peer groups influences perceptions and stereotypes about names
Identify how views of gender influence perceptions and stereotypes about names
Investigate how perceptions and stereotypes about names influence related grammatical processes (e.g. coreference)
Investigate how perceptions and stereotypes about names influence unrelated grammatical processes (e.g. filled gap effects)
Provide a dataset and tools for building informed linguistics example sentences and stimuli
gender-fair name selection informed by participant age range
finding and selecting racially/ethnically representative names
generating age-appropriate lists of gender-balanced (unisex) names
identifying names perceived to be (more) ‘nonbinary’ or ‘binary’
identifying names perceived to be (more) ‘young’ or ‘old’ (relative to participants’ age)
identifying names marked for other stereotypes (e.g. race, country of origin, socio-economic class)
gathering this information by region and by age, as these stereotypes can vary internationally and generationally
Gathering raw data
USA
Hadley Wickham’s babynames package only goes to 2017, but that is an alternative for smaller scope projects
National Records of Scotland: Full list 1974-2022 here
read_csv("data/Scotland/babies-first-names-all-names-all-years.csv",show_col_types =FALSE) |>select(-position) |># rank is not calculated the same way as for USA, position is a char stringrename(year = yr, asab = sex, name = FirstForename) |>mutate(asab =case_when(asab =="B"~"M", asab =="G"~"F",.default ="X"),region ="Scotland",name =str_to_title(name)) |>group_by(year,asab) |>mutate(proportion = number/sum(number, na.rm =TRUE)) |>ungroup() -> names_Scotland
# Thanks to Hadas Kotek for parsing the JSON (3 November 2022)
import pandas as pd
import json
with open('victoria_babies_json_query_return.txt') as f:
json_data = json.load(f)
pd.DataFrame(json_data['popular_baby_names']).to_csv('popular_baby_names.csv', index=False)
Australian Capital Territory
# ACT# no data, too small (only releases brief statement)
New South Wales
# new south walesnames_nsw <-read_csv("data/Australia/NSW/popular_baby_names_1952_to_2022-v2.csv",show_col_types =FALSE) |>rename(rank = Rank,name = Name,number = Number,asab = Gender,year = Year) |>mutate(region ="New South Wales",name =str_to_title(name),asab =case_when(asab =="Male"~"M", asab =="Female"~"F",.default ="X"))
These are likely errors in data entry but they do not seem to be recoverable as this is the data as provided. Since ?LIAM only occurs once, I believe it is safe to keep it in as noise, but since (NOT occurs 8 times in one year across both ASAB categories, I believe it should be removed. A brief exploration suggests there is not another word or words in 1968 in South Australia that have been erroneously included (i.e. whatever text followed ‘NOT’), but I am not 100% certain.
# south australiaread_csv(file =list.files(path ="data/Australia/SouthAustralia/Baby Names 1944-2013",pattern =".csv",full.names =TRUE),col_names =TRUE,id ="file",show_col_types =FALSE) |>mutate(file =str_remove(file, "data/Australia/SouthAustralia/Baby Names 1944-2013/"),file =str_remove(file, "_top.csv")) |>rbind(# had to manually rename "Number" to "Amount" for 2016 filesread_csv(file =list.files(path ="data/Australia/SouthAustralia",pattern =".csv",full.names =TRUE),col_names =TRUE,id ="file",show_col_types =FALSE) |>mutate(file =str_remove(file, "data/Australia/SouthAustralia/"),file =str_remove(file, ".csv"),file =str_remove(file, "top"))) |>rename(name ="Given Name",number ="Amount") |># several files were only top 100 names, not all names, so this throws a warning about file nameseparate(file, into =c("asab","year"), sep ="_") |>group_by(year, asab) |>mutate(name =str_to_title(name),rank =rank(-number, ties.method ="first") |>as.integer(),asab =case_when(asab =="female"~"F", asab =="male"~"M",.default ="X"),region ="South Australia",year =str_sub(year, -4, -1) |>as.numeric()) |>select(-Position) |>filter(name !="TOTAL", # 2016 included "TOTAL" as if it were a name name !="Total", # 2016 included "TOTAL" as if it were a name name !="(Not") |># REMOVE HIGHER FREQUENCY BAD DATAungroup() -> names_soz
# western australia (hand-compiled from website)#WA_babynames_1930_2022 <- read_csv("data/Australia/WesternAustralia/WA_babynames_1930-2022.csv",show_col_types =FALSE) |>mutate(region ="Western Australia",name =str_to_title(name)) -> names_woz
Australia Combined
# combine australian datanames_nsw |>rbind(names_nt) |>rbind(names_queensland) |>rbind(names_soz) |>rbind(names_tasmania) |>rbind(names_victoria) |>rbind(names_woz) |>#-> names_Australiagroup_by(name, asab, year) |>summarise(number =sum(number),.groups ="drop") |># join with detailed births informationleft_join(read_csv("data/Australia/ABS_BIRTHS_SUMMARY_1.0.0_4+5+1..A.csv",show_col_types =FALSE) |># messy data, needs cleaningrename(measure =`MEASURE: Measure`,region2 =`REGION: Region`,year =`TIME_PERIOD: Time Period`,population = OBS_VALUE) |>select(measure, region2, year, population) |>filter(region2 =="AUS: Australia", measure !="1: Births") |>pivot_wider(names_from = measure,values_from = population),by =join_by("year")) |>mutate(proportion =case_when(asab =="M"~ number/`4: Male births`, asab =="F"~ number/`5: Female births`),region ="Australia") |>group_by(year, asab) |>mutate(rank =rank(-number, ties.method ="first")) |>select(year, name, number, rank, asab, region, proportion) |>ungroup() |># join with older historical data (missing ASAB; solution *estimate* by multiplying annual births by .5)left_join(read_csv("data/Australia/Births registered – 1934 to 2022(a).csv", skip =1,show_col_types =FALSE) |>rename(year = Year, population =`Births registered`),by =join_by("year")) |>mutate(proportion =case_when(is.na(proportion) ~ number/(population*.5), # no births by ASAB data, going by .5 of annual births.default = proportion)) |>select(-population) -> names_Australia
Eventually, I would like to add India, China, Japan, Germany, Mexico, and perhaps other places that have large English-speaking populations or large immigrant communities in English-speaking countries. However, this requires resources beyond what I currently have.
Warning: Removed 91016 rows containing missing values or values outside the scale range
(`geom_text()`).
regional.breakdown |>filter(region =="New Zealand",#regional.log.ratio > 5 ) |>ggplot(aes(x = regional.log.ratio, y = total_people, color = region)) +theme_bw() +geom_point(alpha = .5) + ggrepel::geom_text_repel(aes(label = name), max.overlaps =40) +#scale_y_log10() +#facet_wrap(~region, scales = "free_y") +scale_x_continuous(breaks =seq(from =0, to =16, by =1)) +NULL
Warning: Removed 712 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 712 rows containing missing values or values outside the scale range
(`geom_text_repel()`).
regional.breakdown |>filter(region =="Australia",#regional.log.ratio > 5 ) |>ggplot(aes(x = regional.log.ratio, y = total_people, color = region)) +theme_bw() +geom_point(alpha = .5) +#ggrepel::geom_text_repel(aes(label = name), max.overlaps = 40) +#scale_y_log10() +#facet_wrap(~region, scales = "free_y") +scale_x_continuous(breaks =seq(from =0, to =16, by =1)) +NULL
Warning: Removed 53490 rows containing missing values or values outside the scale range
(`geom_point()`).
regional.breakdown |>filter(region =="Canada",#regional.log.ratio > 5 ) |>ggplot(aes(x = regional.log.ratio, y = total_people, color = region)) +theme_bw() +geom_point(alpha = .5) +#ggrepel::geom_text_repel(aes(label = name), max.overlaps = 40) +#scale_y_log10() +#facet_wrap(~region, scales = "free_y") +scale_x_continuous(breaks =seq(from =0, to =16, by =1)) +NULL
Warning: Removed 14716 rows containing missing values or values outside the scale range
(`geom_point()`).
List of names for Name Explorer ratings
In order to elicit ratings of the names in this dataset for research and norming purposes, I need to create a list of all the names without any repeats. I will format it for use in Gorilla (online survey tool).
Here is a table of all the names and their frequencies in the database as rounded to the nearest 100 (number of years for region).
names_combined |>group_by(name) |>summarise(count =n(),.groups ="drop") |>filter(count >3) |># old spreadsheetmutate(freq_block =round(count, digits =-1), # old spreadsheet#freq_block = case_when(count <= 100 ~ round(count, digits = -1),# .default = round(count, digits = -2)) ) |>select(-count) |>filter(freq_block >=100) |># old spreadsheetmutate(display ="Rating",freq_block =round(freq_block, digits =-2)) |># added this in, not sure if it does re-generate original thoughselect(display, name, freq_block) -> original_gorilla; original_gorilla |>arrange(-freq_block)#Name_Rating_Spreadsheet; Name_Rating_Spreadsheet
# Old spreadsheet, with only 100s and greater, no breakdown below 100, filter(count > 3)#write_csv(Name_Rating_Spreadsheet, "data/large_files/Name_Rating_Spreadsheet.csv")
Second Gorilla attempt, log ratio based
Attempt a different type of spreadsheet for Gorilla.
# combine three types of names to create a subset of names to raterbind(# these are the most frequent names with HIGH AMAB ratios combined.breakdown |>filter(log.ratio>3) |>mutate(freq =log(total_amab)) |>mutate(freq_block =round(freq, digits =0),ppl_per_year = total_amab/years) |>#group_by(freq_block) |> summarise(n = n())arrange(ppl_per_year) |>filter(freq >11) |>mutate(type_rank =rank(-log.ratio, ties.method ="first"),type ="amab-biased"),# these are the most frequent names with HIGH AFAB ratios combined.breakdown |>filter(log.ratio>3) |>mutate(freq =log(total_afab)) |>mutate(freq_block =round(freq, digits =0),ppl_per_year = total_afab/years) |>#group_by(freq_block) |> summarise(n = n())arrange(ppl_per_year) |>filter(freq >10.9) |># just to get Ashleigh in the list, if I'm honest...mutate(type_rank =rank(-log.ratio, ties.method ="first"),type ="afab-biased")) |>rbind(# these are the most frequent names with HIGHLY BALANCED ratios combined.breakdown |>filter(log.ratio<=3) |>mutate(freq =log(total_people)) |>mutate(freq_block =round(freq, digits =0),ppl_per_year = total_people/years) |>#group_by(freq_block) |> summarise(n = n())arrange(ppl_per_year) |>filter(freq >9) |>mutate(type_rank =rank(log.ratio, ties.method ="first"),type ="equi-biased")) -> gorilla_subset; gorilla_subset
# New spreadsheet with amab/afab-biased and equi-biased names#write_csv(Name_Rating_Spreadsheet, "data/large_files/Name_Rating_Spreadsheet_3Types.csv")
Warning: ggrepel: 1292 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
List of names for SEPTA Name Norming experiment
The SEPTA Name Norming experiment used only the USA data, but this code generates the list for all regions, for an expanded and more detailed list.
Primary question: Of the names that are assigned to both AMAB and AFAB babies, which ones occur at very close to 1:1 ratios?
The main considerations here are finding names that are consistently assigned in a balanced way, over the years, but not over-privileging names that have been attested for many more years or in particular regions, as different regions report over different timeframes. Furthermore, some names have rapidly shifted “gender valence”, so there are many possible ways to assign rankings. The logic of this method is described within the code annotations.
names_combined |># Include this line to recreate the Name Norming stimuli from this database# filter(region == "USA") |> # Put attestations of AMAB and AFAB babies in adjacent columns to calculate ratios, etcpivot_wider(names_from ="asab", values_from =c(number,proportion,rank),# Unattested cells are given the value 0, since there are no babies with that name in that year.values_fill =0) |># Since dividing by 0 produces an infinite number, only finite ratios are calculated. # All infinite values are given the value 0, which is equivalent for later calculationsmutate(ratio_numrFM =case_when(number_M !=0~ number_F/number_M,.default =0),# Calculate the absolute value of the log ratio as a measure of 'balancedness'log_ratio_numrFM =log(ratio_numrFM) |>abs()) |># For all infinite log ratios (when the ratio was assigned a value of 0), remove value (assign NA)mutate_if(is.numeric, list(~na_if(., Inf))) |># Remove NAs and placeholder baby namesfilter(!is.na(log_ratio_numrFM),!name %in%c("Baby","Infant","Notnamed","Unknown","Unkown","Unnamed")) |>group_by(name) |># calculations for ranking appropriateness of each namesummarise( # how many babies are recorded as being given each name overalltotal_F =sum(number_F, na.rm =TRUE),total_M =sum(number_M, na.rm =TRUE),# how many years in each region recorded this name?entries =n(),# how many regions is the name attested inregions =n_distinct(region),# how many babies are given this name per year per region?mean_F =mean(number_F, na.rm =TRUE)/n_distinct(region),mean_M =mean(number_M, na.rm =TRUE)/n_distinct(region),# Do not calculate the mean log ratio this way, but I don't know why yet##mean.logratio2 = mean(abs(log(mean_F/mean_M)), na.rm = TRUE),# Take the mean of the absolute log ratio values for each namemean.logratio =mean(log_ratio_numrFM, na.rm =TRUE),# Medians, less affected by outliersmedian.logratio =median(log_ratio_numrFM, na.rm =TRUE),# Standard deviation to measure the variance, which will be greater for names that changed valence than those with more static balancednesssd.logratio =sd(log_ratio_numrFM, na.rm =TRUE),# Standard error to measure certainty of the mean valuese.logratio =sd(log_ratio_numrFM, na.rm =TRUE)/sqrt(n()),# Maximum log ratio to set a threshold for how large the ratios could be in a given yearmax.logratio =max(log_ratio_numrFM, na.rm =TRUE),# Minimum log ratio to prioritize names that had perfectly balanced yearsmin.logratio =min(log_ratio_numrFM, na.rm =TRUE),.groups ="drop") |>mutate( # mean number of babies per region per year across ASABsmean.asab = (mean_F+mean_M)/2,# babies per year per region, redistributed to weighted.frequency = mean.asab*entries*regions) |># arrange focusing on range of log ratios, for dealing with tie-breakingarrange(min.logratio, max.logratio, sd.logratio, -weighted.frequency, -mean.asab, -regions, -entries) |># ranks based on range of log ratiomutate(order_logratio =rank(min.logratio, ties.method ="first")) |># arrange focusing on average values of log ratio, for dealing with tie-breakingarrange(mean.logratio, median.logratio, se.logratio, -weighted.frequency) |># ranks based on average log ratio valuesmutate(order_mean =rank(mean.logratio, ties.method ="first")) |># arrange focusing on commonness, for dealing with tie-breakingarrange(-weighted.frequency, sd.logratio) |># rank based on commonnessmutate(order_common =rank(-weighted.frequency,ties.method ="first"), # add previous three ranks to create new ranking valuecombined_order = order_logratio + order_mean + order_common*3, # 3 seems to work across regions?# re-rank based on combined rank for simplicityrank_order =rank(combined_order, ties.method ="first")) |># rearrange based on final ranking, for display purposesarrange(rank_order) -> names_balanced; names_balanced
HERE BE DRAGONS
Briefly, here are some tests of the dataset to demonstrate its utility.
Unisex or gender-balanced names
The names in unisex_names are a complete list of names that occur at least once in both AMAB and AFAB entries. This does not account for data entry errors or other noise in the datasets.
From the combined names dataset, I separated and expanded the names to compare the number of attestations of a name across se
names_balanced$name[1:100] -> unisex_names
Of the unisex_names, which ones occur in at least 0.05% of the population? This helps control for wildly different population sizes of each country, but also helps weed out noise from data entry errors.
However, I believe the large population of the USA and some possible larger-scale data entry errors could be boosting names like “Mary” and “Samantha” into this list. It is worth exploring in the future.
what was i doing here? looks like creating datasets to label plots (see sandbox)
names_combined |>filter(name %in%c(pull(.data = international_unisex_names, var =`7`),pull(.data = international_unisex_names, var =`8`))) |>pivot_wider(names_from ="asab", values_from =c(number,proportion,rank)) |>ggplot(aes(x = proportion_M, y = proportion_F,color = region,label = region)) +geom_abline(slope =1, intercept =0) +geom_path(aes(group = region)) + ggrepel::geom_text_repel(aes(label = year), size =2, color ="black", alpha = .5,max.overlaps =20,) +xlab("proportion of the AMAB population registered with this name in a given year") +ylab("proportion of the AFAB population registered with this name in a given year") +scale_x_log10(labels = scales::percent_format()) +scale_y_log10(labels = scales::percent_format()) +#scale_color_manual(values = c("red","orange","gold","green3","cyan2","blue","blueviolet","violet")) +scale_color_viridis_d(option ="turbo") +facet_wrap(~name, scales ="free", ncol =5) +NULL
names_combined |>filter(name %in%c(pull(.data = international_unisex_names, var =`7`),pull(.data = international_unisex_names, var =`8`))) |>pivot_wider(names_from ="asab", values_from =c(number,proportion,rank)) |>ggplot(aes(x = number_M, y = number_F,color = region,label = region)) +geom_abline(slope =1, intercept =0) +geom_path(aes(group = region)) + ggrepel::geom_text_repel(aes(label = year), size =2, color ="black", alpha = .5,max.overlaps =20,) +xlab("proportion of the AMAB population registered with this name in a given year") +ylab("proportion of the AFAB population registered with this name in a given year") +scale_x_log10() +scale_y_log10() +#scale_color_manual(values = c("red","orange","gold","green3","cyan2","blue","blueviolet","violet")) +scale_color_viridis_d(option ="turbo") +facet_wrap(~name, ncol =5) +#, scales = "free"NULL
names_combined |>pivot_wider(names_from ="asab", values_from =c(number,proportion,rank)) |>filter(name %in% unisex_names,!is.na(proportion_M),!is.na(proportion_F), proportion_M >0.0005, proportion_F >0.0005, name !="Mary") |>arrange(region,name,year) |>filter(name %in% top.names.combined) |>group_by(region, name) |>mutate(n =n()) |>ungroup() |>filter(n >1) |>ggplot(aes(x = proportion_M, y = proportion_F,color = region,label = region)) +geom_abline(slope =1, intercept =0) +geom_path(aes(group = region)) + ggrepel::geom_text_repel(aes(label = year), size =2, color ="black", alpha = .5,max.overlaps =20,) +facet_wrap(~name, scales ="free") +xlab("proportion of the AMAB population registered with this name in a given year") +ylab("proportion of the AFAB population registered with this name in a given year") +scale_x_continuous(labels = scales::percent_format()) +scale_y_continuous(labels = scales::percent_format()) +scale_color_manual(values =c("red","orange","gold","green3","cyan2","blue","blueviolet","violet")) +NULL